Query Join Processing Over Uncertain Data for Decision Tree Classifiers
نویسندگان
چکیده
Traditional decision tree classifiers work with the data whose values are known and precise. We can also extend those classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty measurement/quantization errors, data staleness, and multiple repeated measurements. Rather than abstracting uncertain data by statistical derivatives, such as mean and median, the accuracy of a decision tree classifier can be improved much if the complete information of a data item is used by utilizing the Probability Density Function (PDF). In particular, an attribute value can be modelled as a range of possible values, associated with a PDF. The PDF function has only addressed simple queries such as range and nearestneighbour queries. Queries that join multiple relations have not been addressed with PDF. Despite the significance of joins in databases, we address join queries over uncertain data. We propose semantics for the join operation, define probabilistic operators over uncertain data, and propose join algorithms that provide efficient execution of probabilistic joins especially threshold. In which we avoid the semantic complexities that deals with uncertain data. For this class of joins we develop three sets of optimization techniques: item-level, page-level, and index-level pruning. We will compare the performance of these techniques experimentally.
منابع مشابه
Extending Decision Tree Clasifiers for Uncertain Data
Traditionally, decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple repeated measurements. With uncertainty, the value o...
متن کاملOptimizing Probabilistic Query Processing on Continuous Uncertain Data
Uncertain data management is becoming increasingly important in many applications, in particular, in scientific databases and data stream systems. Uncertain data in these new environments is naturally modeled by continuous random variables. An important class of queries uses complex selection and join predicates and requires query answers to be returned if their existence probabilities pass a t...
متن کاملChapter 10 INDEXING UNCERTAIN DATA
As the volume of uncertain data increases, the cost of evaluating queries over this data will also increase. In order to scale uncertain databases to large data volumes, efficient query processing methods are needed. One of the key techniques for efficient query evaluation is indexing. Due to the nature of uncertain data and queries over this data, existing indexing solutions for precise data a...
متن کاملFast Reachability Query Processing
Graph has great expressive power to describe the complex relationships among data objects, and there are large graph datasets available. In this paper, we focus ourselves on processing a primitive graph query. We call it reachability query. The reachability query, denoted A D, is to find all elements of a type D that are reachable from some elements in another type A. The problem is challenging...
متن کاملScalable Statistical Modeling and Query Processing over Large Scale Uncertain Databases
Title of Dissertation: SCALABLE STATISTICAL MODELING AND QUERY PROCESSING OVER LARGE SCALE UNCERTAIN DATABASES Bhargav Kanagal Shamanna Doctor of Philosophy, 2011 Dissertation directed by: Dr. Amol Deshpande Dept. of Computer Science The past decade has witnessed a large number of novel applications that generate imprecise, uncertain and incomplete data. Examples include monitoring infrastructu...
متن کامل